Chemical Properties of White Wine That Influence Perceived Sensory Quality


Explore the Data Structure

Column Vectors

##  [1] "fixed.acidity"                "volatile.acidity"            
##  [3] "citric.acid"                  "residual.sugar"              
##  [5] "chlorides"                    "free.sulfur.dioxide"         
##  [7] "total.sulfur.dioxide"         "density"                     
##  [9] "pH"                           "sulphates"                   
## [11] "alcohol"                      "quality"                     
## [13] "quality_lev_f"                "fixed_to_volatile_acid_level"
## [15] "sugar_to_sulfates_ratio"

The Dataset contains 4,898 white wines and with 11 variables quantifying the chemical properties of each wine. Comparing the number rows and columns captured by R against the number of rows and columns in the CSV reveals we’ve successfully read the entire file contents. Wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). All columns except for “quality” and the “quality_lev_f” factor contain decimal values.

## Rows: 4,898
## Columns: 15
## $ fixed.acidity                <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, …
## $ volatile.acidity             <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32…
## $ citric.acid                  <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16…
## $ residual.sugar               <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.0…
## $ chlorides                    <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050…
## $ free.sulfur.dioxide          <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, …
## $ total.sulfur.dioxide         <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 13…
## $ density                      <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, …
## $ pH                           <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18…
## $ sulphates                    <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47…
## $ alcohol                      <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8…
## $ quality                      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7…
## $ quality_lev_f                <fct> above avg., above avg., above avg., abov…
## $ fixed_to_volatile_acid_level <dbl> 25.926, 21.000, 28.929, 31.304, 31.304, …
## $ sugar_to_sulfates_ratio      <dbl> 46.000, 3.265, 15.682, 21.250, 21.250, 1…

Statistical Summary of Column Vectors

There’s a nearly identical amount of Below Average and Exceptional wines. The majority of wines fall within the Above Average quality category followed by the Average quality category wines. The variables (excluding ratio variables) with the largest ranges are total.sulfur.dioxide, free.sulfur.dioxide and residual.sugar. This may indicate a likelihood for outliers in these variables.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000  
##      quality_lev_f  fixed_to_volatile_acid_level sugar_to_sulfates_ratio
##  below avg. : 183   Min.   : 5.545               Min.   : 1.169         
##  avg.       :1457   1st Qu.:20.628               1st Qu.: 3.571         
##  above avg. :3078   Median :26.071               Median :10.932         
##  exceptional: 180   Mean   :27.657               Mean   :13.732         
##                     3rd Qu.:33.000               3rd Qu.:21.244         
##                     Max.   :90.000               Max.   :95.362
##                                 [,1]      [,2]
## fixed.acidity                3.80000  14.20000
## volatile.acidity             0.08000   1.10000
## citric.acid                  0.00000   1.66000
## residual.sugar               0.60000  65.80000
## chlorides                    0.00900   0.34600
## free.sulfur.dioxide          2.00000 289.00000
## total.sulfur.dioxide         9.00000 440.00000
## density                      0.98711   1.03898
## pH                           2.72000   3.82000
## sulphates                    0.22000   1.08000
## alcohol                      8.00000  14.20000
## quality                      3.00000   9.00000
## fixed_to_volatile_acid_level 5.54500  90.00000
## sugar_to_sulfates_ratio      1.16900  95.36200


Univariate Plots Section

Fixed Acidity Histogram

Transformed the data to remove the long tails and reveal a Unimodal Distribution. The majority of the Fixed Acidity values lie between 6.3 and 7.3.

Volatile Acidity Histogram

Transformed the data to remove the long tails and reveal a Unimodal Distribution with another singular peak at 0.20. The majority of the Volatile Acidity values lie between 0.21 and 0.32.

Citric Acid Histogram

Transformed the data to remove the long tails and reveal a Unimodal Distribution with another singular peak at 0.50. The majority of the Citric Acid values lie between 0.27 and 0.39.

Residual Sugar Histogram

Transformed the data to remove the long tails and reveal what can best be described as a bimodal distribution with peaks around 1.25 and 1.75. The majority of the Residual Sugar values lie between 1.70 and 1.90. Before removing the tails, there was a fairly even distribution for sugar values of 2 to 5 where the frequency was between 50 and 200.

Chlorides Histogram and Boxplot

Transformed the data to remove the long tails and reveal a unimodal distribution with a peak around 0.0425.

Free Sulfur Dioxide Histogram

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Free Sulfur Dioxide values lie between 23.0 and 46.0.

Total Sulfur Dioxide Histogram

Transformed the data to remove most of the long tails and reveal a mostly unimodal distribution. The majority of the Total Sulfur Dioxide values lie between 108.0 and 167.0.

Wine Density Histogram and Boxplot

Transformed the data to remove the long tails. The majority of the Density values lie between 0.991 and 0.996.

pH Histogram

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the pH values lie between 3.09 and 3.28.

Sulfates Histogram

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Sulfates values lie between 0.41 and 0.55.

Alcohol Histogram and Boxplot

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Alcohol values lie between 9.50 and 11.40.

Quality Histogram and Boxplot

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Quality values lie between 5 and 7.

Fixed-To-Volatile Acid Level Histogram

Transformed the data to remove the long tails and reveal a unimodal distribution. The majority of the Fixed-To-Volatile Acid values lie between 20.628 and 33.000.

Residual Sugar To Sulfates Level Histogram

Reveals a long tailed unimodal distribution. The majority of the Residual Sugar To Sulfates Level values lie between 3.571 and 21.244.


Univariate Analysis

Data Set Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

What is the structure of your dataset?

The Dataset contains 4,898 white wines and with 11 variables quantifying the chemical properties of each wine (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol” and “quality”). It’s worth noting that there were no wines in the data set with qualities of 0, 1, 2 or 10. The “Quality” attribute is a sensory output derived from the median of at least 3 evaluations made by wine experts while all other attributes are physicochemical inputs. Excluding the Alcohol attribute, all variables have outliers above their interquartile range.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the “Quality” output variable. The goal is to determine which input variable(s) have the greatest influence on the perceived wine quality. I suspect Alcohol and pH are two of the leading contributors to perceived quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Notes about the data set indicate that several of the attributes may be correlated. I believe there’s likely a correlation between pH, Fixed Acidity, Volatile Acidity and Citric Acid. I also believe there may be a correlation between Alcohol and Residual Sugar.

Did you create any new variables from existing variables in the dataset?

The “quality” variable was used to create 4 Quality-Level factors for a “quality_lev_f” variable. The 4 factors are: “below avg.”, “avg.”, “above avg.” and “exceptional”. Below Average wines had quality levels from 0 to 4. Average wines had a quality level of 5. Above Average wines had quality levels from 6 and 7. Exceptional wines had quality levels from 8 to 10.

Another variable called “fixed_to_volatile_acid_level” was created to investigate if the balance between Fixed and Volatile Acids plays any role in perceived quality. Likewise another variable called “sugar_to_sulfates_ratio” was created to investigate if the balance between Residual Sugar and Sulfates plays any role in perceived quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The most unusual distribution thus far is the Chlorides distribution. The interquartile range is only 0.01 and there are a significant number of outliers above the interquartile range. Many other variables have outliers, so I expect to need to limit data based on quartiles to reduce outliers or use a log10 transformation. Most transforms performed will serve to eliminate distributions tails or reveal what sort of distribution a variable has.


Bivariate Plots Section

Wine Quality and Independent Variable Adjusted R-Squared Values

Based on the the calculations from the Linear Regression models and R2 values, there are a few key independent variables that explain the variance in Wine Quality. The top 5 variables and their adjusted R2 values are ranked in descending order as follows:

  1. Alcohol = 0.18956
  2. Density = 0.09414
  3. Chlorides = 0.04388
  4. Volatile Acidity = 0.03772
  5. Total Sulfur Dioxide = 0.03034

We’ll continue to focus on what relationships these variables have with wine quality.

## [1] "adjusted R2= 0.01272 fixed.acidity"
## [1] "adjusted R2= 0.03772 volatile.acidity"
## [1] "adjusted R2= -0.00012 citric.acid"
## [1] "adjusted R2= 0.00932 residual.sugar"
## [1] "adjusted R2= 0.04388 chlorides"
## [1] "adjusted R2= -0.00014 free.sulfur.dioxide"
## [1] "adjusted R2= 0.03034 total.sulfur.dioxide"
## [1] "adjusted R2= 0.09414 density"
## [1] "adjusted R2= 0.00968 pH"
## [1] "adjusted R2= 0.00268 sulphates"
## [1] "adjusted R2= 0.18956 alcohol"
## [1] "adjusted R2= 1 quality"
## [1] "adjusted R2= 0.83039 quality_lev_f"
## [1] "adjusted R2= 0.0206 fixed_to_volatile_acid_level"
## [1] "adjusted R2= 0.00648 sugar_to_sulfates_ratio"

Correlation Chart

There are too many variables at play in the above Correlation Chart, let’s focus on variables with a strong correlation with Wine Quality. There’s a strong positive or negative correlation between Alcohol and most other variables except for Volatile Acidity. The Strongest positive correlation between Wine Quality is that of Alcohol. The Strongest negative correlation between Wine Quality is that of Wine Density.

Wine Quality and Alcohol Lineplot

There’s a smaller alcohol range for Quality 3 and Exceptional wines. Above Average quality wines have the largest Alcohol content range. Average quality wines tended to have an Alcohol content range between 9 and 10. Exceptional Quality Wines had an ABV between 11 and 13.

Wine Quality and Wine Density Lineplot

The mean density appears to decrease as the wine quality increases. The average quality wine has a largest variability in density. Quality 3 and Exceptional wines have the smallest density range.

Wine Quality and Total Sulfur Dioxide Lineplot

The Total Sulfur Dioxide level appears to decrease as the quality increases. All quality levels exception for Quality 9 have a large range in values, however the maximum Total Sulfur Dioxide for Exceptional wines is much lower than that of lower quality wines.

Wine Quality and Chlorides Lineplot

The Chloride level decreases as the wine quality increases. The Chloride range between the different quality levels is pretty narrow. Despite its lower ranking in the Top 5 R-Squared values, this narrow range may indicate that the amount of Chlorides goes a long way in affecting a wine’s quality.

Wine Quality and Volatile Acidity Lineplot

There’s similar Volatile Acid levels between Below Average and Exceptional Wines. Above Average Wines have a similar mean Volatile Acidity of around 0.25. Excluding Quality 3 and 9 Wines, all quality levels seem to have a similar range in values. However, I would assume that with more data points for Quality 3 and 9 Wines, we’d likely see a similarly large range.

Wine Quality and pH Distribution Averages

There’s not a significant difference in pH values between wine quality levels with most values being between 3.1 and 3.25 pH. Average and Above Average wines had the largest pH ranges.

Wine Quality and Citric Acid Distribution Averages

Exceptional Quality Wines and Quality 3 Wines have the smallest range of Citric Acid values. Average and Above Average Wines have the largest range of Citric Acid values.

Quality and pH Percentiles

The 1st and the 99th Percentile for pH values become significantly smaller for Exceptional Wines.

Quality and Sugar to Sulphate Ratio Percentile Lineplot

Narrower Sugar to Sulfate Ratio range for the Exceptional quality wines.

Quality and Alcohol Lineplot

Average quality wines tend to be between 9% and 11% ABV. ABV tends to increase linearly with wine quality.

Quality and Total Sulfur Dioxide Percentiles

The 99th Percentile for Total Sulfur Dioxide drops significantly for Exceptional wines and generally the mean Total Sulfur Dioxide appears to decrease for Above Average wines.

Marginal Histograms

Quality and Alcohol Density Plot

The linear model indicates that the alcohol level generally increases as the quality increases. There significantly more Quality 8 Exceptional Wines than there are Quality 9 Exceptional Wines. The Quality 9 wines are most likely to be around 13% ABV. Qualities 4-7 have a generally even distribution of alcohol levels ranging from 8.5% to 13%. The majority of the Above Average Quality wines are of Quality 6 at 11% ABV.

Below Average and Exceptional Wine Quality Compared to Alcohol and Wine Density

As density decreases and alcohol increases, so does the quality of wine.

Below Average and Exceptional Wine Quality Compared to Total Sulfur Dioxide and Chlorides

Exceptional Wines are below 0.06 Chloride and are generally between 84 and 134 Total Sulfur Dioxide.

Below Average and Exceptional Wine Quality Compared to Total Sulfur Dioxide and Volatile Acidity

There’s a lot of similarity between Total Sulfur Dioxide and Volatile Acidity for both Below Average and Exceptional Wines. Exceptional wines tended to have slightly more clustering in their values, but that difference between Below Average wines is negligible


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The most significant relationships related to quality were the relationship between Quality and Alcohol (0.436) and the negative relationship between Quality and Density (-0.307). The mean density appears to decrease as the wine quality increases and inversely, the alcohol level increases as the wine quality increases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Two of the most interesting and unexpected relationships were between Total Sulfur Dioxide and Density (0.53), Total Sulfur Dioxide and Alcohol (-0.499), as well as the negative relationship between Alcohol and Density (-0.78). This may indicate that denser wines can hold more Total Sulfur Dioxide in solution and/or the Total Sulfur Dioxide in solution contributes to density. The relationship between Total Sulfur Dioxide and Alcohol may indicate that Sulfur Dioxide dissipates as the Alcohol level increases with fermentation. The Alcohol and Density relationship seems to indicate that Alcohol decreases density sense alcohol is less dense than water and residual sugar.

What was the strongest relationship you found?

The strongest relationship was the negative relationship between Alcohol and Density with a correlation value of -0.78. Overall Alcohol appears to correlate with many variables.


Multivariate Plots Section

Below Average and Exceptional Wine Quality and the Relationship Between pH and Citric Acid

Below Average and Exceptional Wine Quality and the Relationship Between Alcohol and Wine Density

Exceptional quality wines tend toward higher alcohol and lower density.

Below Average and Exceptional Wine Quality and the Relationship Between Alcohol and Residual Sugars

Below Average wines have a bit of clustering around the low end of Residual Sugars, but their Alcohol Range is wide. Above Average wines favor higher alcohol and generally have Residual Sugar values below 10.

Wine Quality with Alcohol and pH Cross Comparision

All wines quality levels tend to have the same range of pH values. Average wines tend to cluster around 3.0 to 3.3 pH levels and 9% to 10% ABV. Above Average wines have the greatest range of values for both pH and Alcohol levels.

Wine Quality with Residual Sugar and pH Cross Comparison

Most wines tend to cluster around low values (2 to 3) for Residual Sugar yet have an even range of pH values.

Wine Quality with Alcohol and Residual Sugar Cross Comparison

Most wines tend to cluster around low values (2 to 3) for Residual Sugar. Average and Above Average wines tend to have alcohol levels between 9 and 12.5.

Wine Quality with Residual Sugar and Total Sulfur Dioxide Cross Comparison

Most wines tend to cluster around low values (2 to 3) for Residual Sugar and share a similar range of 50 to 200 for their Total Sulfur Dioxide.

Summarize Alcohol for Quality Levels

Higher quality wines tended to be on the upper end of the alcohol range with values between 11.40 to 12.18.

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.8    10.8    10.8    11.8    14.2 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00

Summarize Density for Quality Levels

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9960  1.0004 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9912  0.9930  0.9935  0.9955  1.0390 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006

Summarize fixed_to_volatile_acid_level for Quality Levels

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.545  14.891  22.188  22.854  27.600  60.588 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.974  19.697  23.871  25.328  29.200  69.286 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.987  21.481  27.600  29.043  35.263  90.000 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.091  19.229  26.727  27.681  35.870  61.538

Summarize sugar_to_sulfates_ratio for Quality Levels

The minimum Residual Sugar to Sulfate Level is the highest for Quality 9 wines. The mean Residual Sugar to Sulfate Level is lowest for Quality 9 wines. The maximum Residual Sugar to Sulfate Level is lower for both lower (Quality 3 and 4) and the highest quality wine. The Residual Sugar to Sulfate Level range is smallest for Quality 9 wines.

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.628   3.056   5.294  10.387  15.819  35.000 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.29    3.81   14.81   15.61   23.48   64.57 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.169   3.556  10.000  13.077  19.952  95.362 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.368   3.825   8.757  13.113  21.297  48.333

Summarize pH for Quality Levels

Excluding wines of quality 3 and 4, as the quality of wine increase, so does the mean pH value the max pH level for wines of quality 8 and 9 drops significantly compared to qualities 3 through 7.

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.060   3.160   3.183   3.285   3.720 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.190   3.196   3.290   3.820 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.127   3.230   3.221   3.330   3.590

Summarize Total Sulfur Dioxide for Quality Levels

The lowest quality wine has over 3x times the maximum Total Sulfur Dioxide when compared to the highest rated. Excluding Quality 5 wines, the max Total Sulfur Dioxide tended to decrease as the quality increased.

## white_wine$quality_lev_f: below avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.5   119.0   130.2   177.0   440.0 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   121.0   151.0   150.9   182.0   344.0 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: above avg.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0   105.0   129.0   133.6   159.0   294.0 
## ------------------------------------------------------------ 
## white_wine$quality_lev_f: exceptional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0   102.8   122.0   125.9   148.5   212.5

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most notable relationship was that of Wine Quality, Alcohol and Density. Exceptional quality wines tend toward higher alcohol and lower density and Below Average wines Exceptional quality wines tend toward lower alcohol and higher density.

The relationship between Quality, Alcohol and Residual sugars was less definitive, but Below Average wines have a bit of clustering around the low end of Residual Sugars, but their Alcohol Range is wide. Above Average wines favor higher alcohol and generally have Residual Sugar values below 10.

There weren’t any significant relationships between Quality, pH and Citric Acid. Values for both Below Average and Exceptional wines were seemingly equally distributed.

There weren’t any significant relationships between Quality, Alcohol and pH. All wines quality levels tend to have the same range of pH values. Average wines tend to cluster around 3.0 to 3.3 pH levels and 9% to 10% ABV. Above Average wines have the greatest range of values for both pH and Alcohol levels.

Were there any interesting or surprising interactions between features?

In many cases there are quite a few similarities between the highest and lowest quality wines. The Average and Above Average wines offered the greatest range of values for most variables. There were interesting discrepancies between a few of the Quality 8 and 9 wines. Quality 9 wines tended to see higher alcohol levels, lower Total Sulfur Dioxide, a lower Sugar-To-Sulfate Ratios and increased mean pH value.


Final Plots and Summary

You will select three plots from your analysis to polish and share in this section. The three plots should show different trends and should be polished with appropriate labels, units, and titles (see the Project Rubric for more information).

Plot 1

Plot 1 Description

The distribution above, shows a trend of the Chloride level decreasing as the wine quality increases. The most significant decrease is from 0.040 for Quality 8 wines, down to 0.0325 for Quality 9 wines. The difference in Chloride range between the different quality levels is seemingly negligible, however, despite Chlorides’ lower ranking in the Top 5 R-Squared values, this narrow range of approximately 0.025 may indicate that the amount of Chloride goes a long way in affecting a wine’s flavor and quality.

Plot 2

Plot 2 Description

There’s a lot of similarity between Total Sulfur Dioxide and Volatile Acidity for both Below Average and Exceptional Wines. However, Exceptional wines tended to have slightly more clustering in their values in both Total Sulfur Dioxide and Volatile Acidity. The Total Sulfur Dioxide values for Exceptional wines form a near-normal distribution, whereas the Volatile Acidity form a looser normal distribution with a higher variability in values. The Below Average wines have significantly longer tails in their Total Sulfur Dioxide distribution, but that difference between Below Average wines is negligible.

Plot 3

Plot 3 Description

As density decreases and alcohol increases, so did the quality of wine. The balance between Alcohol and Density is key determinant of wine quality. Below Average wines tended to have a higher density and Exceptional wines had a lower density. The opposite was true for Alcohol levels — as Alcohol level increased, so did the wine quality. Below Average wine tended to have a broader range of Alcohol and Density values whereas the Exceptional wines tended to cluster around the 12% to 13% ABV and 0.985 to 0.9925 Density levels. It is worth pointing out that it’s unclear if wines received a higher quality because of their higher ABV, because of their lower density or a combination of both factors.


Reflection

Struggles

Some of the major struggles with this analysis were determining which variables to compare, determining how and when to transform or limit vectors and determining which type of plots or functions conveyed information in the clearest manner. Also, determining which R packages provided which statistical and plotting functionality made the initial investigate progress slowly. Early on, one struggle was hoping to find a strong correlation between Wine Quality and variables (pH and Citric Acid specifically) when there simply wasn’t a strong correlation compared to other variables. Lastly, understanding which additional ratios or calculations would serve as an additional column vector to explore was initially unclear until re-examining the variable definitions and variable relationships.

Successes

The largest success came from using Adjusted R-Squared values and GGPairs to help isolate which variable relationships were worth exploring. These tools also highlighted whether a positive or negative correlation existed between the variables and helped explain interactions and trends visible on plots.

Grouping the numerical Quality vector into descriptive factors made it easier to group wines by quality using common quality descriptors and made easier to see trends between different wine quality groupings.

Using Marginal Histograms was also immensely beneficial in communicating distributions, scatterplot clustering and patterns between the wine quality levels.

Lastly, realizing there was a strong correlation between Alcohol and Density helped drive a comparison between Alcohol, Density and Quality — this ultimately unveiled a strong relationship between the 3 variables, which makes sense in retrospect that as the density or sugars decreases, the amount of alcohol increases.

Future Exploration

There’s plenty of room for further exploration and modeling. As with most analysis, a larger data set specifically for Below Average and Exceptional wines would tease out trends and smooth out averages. There was plenty of data for Average Quality wine but fewer data points for the tails of the Quality Distribution. Also, more chemical compounds and variables from the wine such as grape varietal(s), age, price, brand, and region would be interesting to explore.